INFORMATION CAPACITY OF BIOLOGICAL MACROMOLECULAE 
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Information capacity of a symbol sequence is a measure of the unexpectedness of a continuation 
of given string of symbols. Continuation of a string is determined through the maximum entropy of 
the reconstructed frequency dictionary; the capacity, in turn, is determined through the calculation 
of mutual entropy of a real frequency dictionary of a sequence with respect to the reconstructed one. 
The capacity does not depend on the length of strings in a dictionary. The capacity calculated for 
various genomes exhibits a multi-minima pattern reflecting an order observed within a sequence. 

PACS numbers: 87.10.+e, 87.14.Gg, 87.15.Cc, 02.50.-r 



I. INTRODUCTION 

The analysis of statistical patterns in completely se- 
quenced genomes is of great interest. The correlations 
observed within these latter reflect some biological fea- 
tures of primary structures 0,0! SB- I n particular, the 
sequence periodicity of 3 base pairs (bp) indicates the 
presence of protein coding regions in a genome; more ex- 
actly, non-coding regions are invariant against the frame 
shift of the codon pattern, while the coding ones lack 
these invariance 0, 0, • 

A complexity of patterns observed in a genetic se- 
quence may vary significantly. The complexity itself is 
a matter of interests of mathematicians, biologists and 
biophysicists 0, 0, EO, HH E3, E3, E3, E3| Screening a 
genome with respect to a complexity of different frag- 
ments of that latter, a student my find various biolog- 
ically important peculiarities in a nucleotide sequence. 
Information capacity measurements bring, in turn, a new 
knowledge towards the genetic entities. Here we present a 
new approach to determine the information capacity of a 
symbol sequence, with applications to genetic sequences. 

To begin with, it should be stressed, that a symbol 
sequence has zero information content, zero information 
capacity, and zero redundancy, being a finite object. To 
discuss all these issues with particular respect to the ge- 
netic entities, one must change a finite sequence for a 
frequency dictionary of short strings. Such transforma- 
tion makes a student change a consideration of a finite 
sequence for the ensemble of (infinite) sequences, which 
yield the same distribution of frequency of short strings, 
as a given dictionary determined over a finite sequence; 
further, we shall no more mention this difference, pro- 
vided that no misunderstanding occurs. If any, special 
remarks would be done. 

Physically, DNA sequence is a polymer molecule, 
which could be considered as a symbol sequence from 
the four-letter alphabet H = {A, C, G, T}, where A refers 



to adenine, C refers to cytosine, G refers to guanine, and 
T refers to thymine. Let N be the length of the sequence, 
i.e. the number of symbols in it. Further, we shall con- 
sider the continuous sequences only; a consideration of 
unbound sequences is possible, while brings no new com- 
prehension, but the technical problems 0, 0] . 

Any continuous string of the length q, 1 < q < 
N observed within a sequence makes a word lo = 
v\V2V^ . . . Vq~\v q (of the length q); here Vj e {A, C, G, T}. 
A set of all the words (of the given length q) observed at 
the sequence makes the support of that former (or q- 
support, if indication of the length is necessary). Provid- 
ing each element of a support (i.e., each word u>) with the 
number of copies of that latter, one gets the dictionary 
5 g of the sequence (of the thickness q). The dictionary 
3g is a finite object, as well. Changing the number of 
copies n u for frequency: 

, _ n u 
N ' 

one gets a frequency dictionary W q (of the thickness 
l). 

Such definition of frequency requires a connection of a 
sequence into a ring; the motivation behind such trans- 
formation is simple and obvious. Any dictionary of the 
thickness q could easily be transformed into a thinner 
dictionary q — 1. To get the dictionary W q -±, one must 
sum up the frequency of words differing in the first, or 
in the last symbol. Being provided over a dictionary 5q, 
these two summations would yield two different results; 
the difference results from a finite sampling. The starting 
q — 1 symbols will not be accounted, if the summation 
would be carried over the first symbol; reciprocally, the 
ending q— 1 symbols will be lost, if the summation would 
be carried out over the last symbol. A connection of a 
sequence into a ring eliminates the problem. 

Consider a symbol sequence. Compose, then, a chain 
of dictionaries Wj of increasing thickness j: 



W q ^ W qA 



W N 
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(1) 

Study of statistical properties of symbol sequences means 
an investigation of the relations between the dictionaries 
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within the chain. A downward transformation, i.e., the 
transition from Wj to Wj-% is simple and unique. On 
the contrary, the upward transformation is ambiguous, 
in general. A word u) may have several continuations 
(not more than 4, in case of nucleotide sequences). Such 
ambiguity results in a positive information capacity of 
the relevant frequency dictionary. 



nations of words, but an abundance of the combinations 
is still great enough. 

Since a set of dictionaries {W 7 ^, W^\, W^\, . . . , 

Wq+i } still meet the constraint J2J , one has to figure out 
a single one that is expected to be the reconstructed en- 
tity. Such reconstructed dictionary is identified through 
the maximum entropy 



II. RECONSTRUCTED FREQUENCY 
DICTIONARY 



An ambiguity of transformation of a thinner dictionary 
into a thicker one rises a question towards the recon- 
struction of the dictionary. Indeed, while the downward 
transformation Wj — * Wj—i is unique, the upward trans- 
formation, in general, generates several dictionaries. A 
transformation of Wj into Wj+i consists in a combination 
of words from the frequency dictionary Wj so, that the 
dictionary bearing the combined longer words yields the 
original frequency dictionary. In other words, each com- 
bined set ft lV . 2 v 3 ...v q ^ 1 v qVq+1 of longer words must meet 
the constraint 



Y r 



V\VlVz...V q -\V q v q + i 



V\ V 2 V$...Vq-\V„ 



f, 



VxV 2 U3...Va-lU„ 7 



(2) 



where fuiv 2 v 3 ...u„-iv„ is the frequency of a word u> = 
v\v-2,vz ■ ■ ■ v q -\v q from the given frequency dictionary. 
Linear constraints J2J eliminate a part of possible combi- 



(./) 



(3) 



of a dictioary, where u>* = v\V2V^ ■ ■ • v q -ii/ q v q +i is a word 
meeting the linear constraint J3J. The dictionary W q +\ 
meeting the maximal principle @ exists always, since 
the set of the dictionaries which could be combined from 
the given one is finite. 

The frequency of words uj G W q+ 1 could be fig- 
ured out explicitly, by LaGrange multiplier method 
IB EU EH- Frequency of a word u> of the recon- 
structed dictionary oj is determined by the expression 



■V q -\V q X JV 2 V 3 ...V q -xV q V q+ x 



v\Vlv?,...v q -xv q v q +\ 



f, 



V2V 3 ...V q -xU q 



(4) 

The expression (01 coincides perfectly to the Kirkwood's 
approximation 22] ; an absence of the interaction via the 
third "particle" makes the expression J3J here an exact 
solution of the problem. 

Actually, the maximal entropy principle @ allows to 
reconstruct the dictionary W q +i for any I > 1. Here we 
provide a final formula 



.V q -\V q ^ fl 



V2V$Vi...V q Vq+1 



X/, 



viv l + l v i+2 ...v q+ 



l-2fg + l-l X JVl+l 



f! + 2^( + 3 ■■■V q J r l-XV q+ l 



I v 1 u 2 u 3 ...v q+ i- 1 u q+ i 



■ v q -\v q X fl 



v 3 ,v i v$...v q v q+1 



X fvi + 2Vl + 3Vl + i---V q + l--2U q + l-l 



; (5) 



see pHHiii for details. 

Reconstruction of a thicker dictionary Q due to the 

maximum entropy principle yields the dictionary W q+ i 

(or Wg+i, respectively), that bears no outer, additional 
information. It contains the words of the length q + 1 (of 
the length q + l, respectively) that are the most probable 
continuations of the words of the length q. The recon- 
structed dictionary W q +\ bears all the words that occur 
at the dictionary W q +i and, maybe, some other ones. For 
any q, q > 1 



W q+1 >S[W q+1 ] 



Quite often, the expression (0} is considered to be an 
evidence of the Markovian property of an original se- 



quence |23|, while that is not true. The expression JIJ is 
derived with no respect to a structure of the sequence. 
Indeed, this expression coincides to the formula for the 
Markov process of q-th order. The coincidence is not 
odd; it means, that Markov model of a sequences re- 
alizes the hypothesis of the most probable continuation 
of a string. We shall discuss this issue further (see sec- 
tion CVSl. 



III. INFORMATION CAPACITY 

Information capacity is a measure of deviation of the 
reconstructed dictionary I|1U|) from the real one. The de- 
viation could be measure in various ways. The approach 
based on so called "quality of reconstruction" of dictio- 
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TABLE I: Information capacity (I1UB determined for three chromosomes of Schizosaccharomyces pombe complete genome and 
for eleven chromosomes of Encephalitozoon cuniculi. q is the dictionary thickness. 



S.pombe 



E. cuniculi 



chr.I chr.II chr.III 



I 



II 



III 



IV 



V 



VI 



VII 



VIII 



IX 



X 



XI 



0.00757 
0.00202 
0.00204 
0.00079 
0.00092 
0.00231 
0.00428 
0.00955 
0.03413 
0.18561 
0.41966 
0.47562 
0.42864 
0.17547 
0.09254 
0.04369 
0.01485 
0.00204 
0.00021 



0.00743 
0.00200 
0.00205 
0.00080 
0.00108 
0.00219 
0.00387 
0.00758 
0.02986 
0.14578 
0.38547 
0.41855 
0.39654 
0.15487 
0.08754 
0.03965 
0.00987 
0.00175 
0.00018 



0.00793 
0.00211 
0.00217 
0.00102 
0.00121 
0.00201 
0.00346 
0.00699 
0.07854 
0.37254 
0.39866 
0.32564 
0.12658 
0.06246 
0.02457 
0.01288 
0.00712 
0.00121 
0.00013 



0.02042 
0.01022 
0.00565 
0.00503 
0.00950 
0.02737 
0.10920 
0.31844 
0.43406 
0.27997 
0.11074 
0.03657 
0.01142 
0.00380 
0.00122 
0.00045 
0.00023 
0.00015 
0.00020 



0.01953 
0.01027 
0.00484 
0.00385 
0.00874 
0.02726 
0.10910 
0.32478 
0.43628 
0.28007 
0.10991 
0.03448 
0.01000 
0.00313 
0.00109 
0.00030 
0.00008 
0.00010 
0.00005 



0.01846 
0.01018 
0.00466 
0.00425 
0.00831 
0.02826 
0.11163 
0.33002 
0.43629 
0.27795 
0.10736 
0.03309 
0.00975 
0.00269 
0.00090 
0.00019 
0.00011 
0.00005 
0.00004 



0.02016 
0.00945 
0.00448 
0.00401 
0.00772 
0.02455 
0.09981 
0.30823 
0.44194 
0.29237 
0.11744 
0.03791 
0.01142 
0.00338 
0.00108 
0.00032 
0.00008 
0.00008 
0.00005 



0.01968 
0.01067 
0.00417 
0.00398 
0.00843 
0.02626 
0.10269 
0.30911 
0.43809 
0.28930 
0.11686 
0.03857 
0.01118 
0.00300 
0.00101 
0.00029 
0.00010 
0.00004 
0.00001 



0.02100 
0.01088 
0.00472 
0.00406 
0.00807 
0.02523 
0.10014 
0.30287 
0.43332 
0.29570 
0.12058 
0.04054 
0.01200 
0.00376 
0.00097 
0.00037 
0.00020 
0.00005 
0.00001 



02102 
00992 
00469 
00393 
00745 
02466 
09642 
29743 
43741 
29900 
12409 
04062 
01236 
00364 
00091 
00035 
00012 
00005 
00001 



0.02057 
0.00982 
0.00472 
0.00377 
0.00771 
0.02305 
0.09141 
0.29264 
0.43859 
0.30620 
0.12828 
0.04082 
0.01204 
0.00356 
0.00122 
0.00027 
0.00014 
0.00006 
0.00005 



0.01990 
0.00974 
0.00429 
0.00349 
0.00669 
0.02229 
0.08825 
0.28492 
0.44005 
0.31277 
0.13220 
0.04183 
0.01244 
0.00344 
0.00109 
0.00036 
0.00012 
0.00003 
0.00001 



0.02045 
0.01023 
0.00445 
0.00386 
0.00699 
0.02114 
0.08316 
0.27174 
0.43754 
0.32115 
0.13832 
0.04613 
0.01384 
0.00404 
0.00112 
0.00043 
0.00013 
0.00003 
0.00004 



0.01953 
0.00989 
0.00469 
0.00407 
0.00694 
0.02128 
0.08238 
0.27101 
0.43805 
0.32204 
0.13953 
0.04571 
0.01350 
0.00383 
0.00126 
0.00042 
0.00014 
0.00007 
0.00002 



nary is discussed in 0, El |2j|. A student may im- 
plement a regular Euclidean distance to determine the 
difference between W q+ i and W q+ \. Here we explore 
more sensitive and more efficient method to detect the 
difference between the entities, based on the calculation 
of mutual entropy. 

Mutual entropy of a distribution cj) with respect to a 
distribution if;* is defined as 



sm<t>\ = $> -In 



(6) 



where a is the space of definition of distributions (f> and ip* 
|24],|23. Here the distribution ^/>* is the equilibrium one. 
We shall define the information capacity in similar way: 
real frequency W q should be considered to be a distri- 
bution, while the reconstructed dictionary W q should be 
considered the "equilibrium" one. Such definition holds 
true, since the g-support of the reconstructed dictionary 
always cntains the g-support of the real one. 

To determine an information capacity of a frequency 
dictionary W q , one needs to develop the relevant recon- 
structed dictionary W q (of the same thickness q). For- 
mula l@J is changed for 



VlV 2 V 3 ...V q -2 



Vq-1 X fl 



V2V 3 ...V„-lVq 



V 1 lJ2U 3 ...V q - 1 V q 



, (7a) 



dictionary of the thickness s, s < q— 1 could be provided, 
as well; further we shall keep within the case of the re- 
construction of W q over W q -\. The expression © looks 
like 



S 



w q \w q 



Jlu 



(8) 



for the case of frequency dictionaries. Substituting (JTJ) 
into ©, one gets 



W q \W q 



foj ■ In 



W 2 \W 2 



fu X fu' 

fu)L X JU 



lux x Jv 



(9a) 



(9b) 



here w L = VtV 2 ■ ■ • ^-2^-1, u R = v 2 v 3 . . . v q -\v q and 
lu" = lul H lur = v 2 v^ . . . v q - 2 v q —\. Expanding the ratio 
in (JHJ into a sum of four terms and summing up over the 
"extra" indices, one gets 

S q = 2S q -i - S q - S q -2 and S 2 = 2S± - S 2 . (10) 

The formulae (f 1 Cjft are changed for 

S q = (q— s+l)S s -S q — (q— s)S s -i and S q = qSi—S q 
for the case of ©. 



fvi x fv 



(7b) 



with l|7b(l for the case of q = 2. The formulae for the 
reconstruction of a dictionary of the thickness q over a 



A. Some properties of information capacity 

The information capacity defined according to H1()J) ex- 
hibits some peculiarities making the capacity a power- 
ful tool for a study of symbol sequences. Let's consider 
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TABLE II: Information capacity HOI of nineteen complete genomes of archea bacteria. N is the length of a genome. 



Entry 



Species 



N 



q = 2 



= 3 



q = 4 



Frequency 



thickness 
9 = 6 



q = 7 <? = 8 q = 9 



A.pernix Kl 

A.Mgidus DSM 4304 

Halobacterium sp. 

M.thermoautotrophicum 8 H 

M.jannaschii DSM 2661 

M.maripaludis 

M.kandleri AV19 

M.acetivorans 

M.mazei 

N.equitans 

P.torridus DSM 9790 

P.aerophilum 

P.abyssi 

P.furiosus DSM 3638 

P.horikoshii 

S.solfataricus 

S.tokodaii 

Th . acidophilum 

Th.volcanium 



1669695 
2178400 
2014239 
1751377 
1664970 
1661137 
1694969 
5751492 
4096345 
490885 
1545895 
2222430 
1765118 
1908256 
1738505 
2992245 
2694756 
1564906 
1584804 



0.015290 
0.018170 
0.028372 
0.021919 
0.020938 
0.015925 
0.016257 
0.014008 
0.015407 
0.018216 
0.015631 
0.007622 
0.015921 
0.021224 
0.019283 
0.007988 
0.009540 
0.010168 
0.005425 



0.007944 
0.007606 
0.007114 
0.012330 
0.007459 
0.006152 
0.008819 
0.010528 
0.013163 
0.012787 
0.014016 
0.014392 
0.008895 
0.005685 
0.007373 
0.003279 
0.002887 
0.012824 
0.006657 



0.012311 
0.011233 
0.013193 
0.009047 
0.013321 
0.006783 
0.006385 
0.003869 
0.004834 
0.008837 
0.010894 
0.010685 
0.005493 
0.003697 
0.004917 
0.004015 
0.004678 
0.005046 
0.002924 



0.006464 
0.008505 
0.008810 
0.006213 
0.004747 
0.005068 
0.003698 
0.001678 
0.001999 
0.004529 
0.008151 
0.010080 
0.005236 
0.003117 
0.003867 
0.002656 
0.002896 
0.003288 
0.001421 



0.005223 

0.006617 

0.013955 

0.007599 

0.006425 

0.005482 

0.003846 

0.001791 

0.002125 

0.007170 

0.006663 

0.004820 

0.004513 

0.003536 

0.003366 

0.002072 

0.002598 

0.003858 

0.002066 



008679 
008051 
013883 
009382 
009844 
008590 
007801 
004272 
004579 
016564 
008954 
006798 
006985 
006347 
006636 
005709 
005675 
006748 
005435 



0.015346 
0.013013 
0.016140 
0.016337 
0.016392 
0.017090 
0.014203 
0.005894 
0.006651 
0.044161 
0.017796 
0.012019 
0.014085 
0.013273 
0.013916 
0.012057 
0.010279 
0.014218 
0.013266 



0.053446 
0.041986 
0.044798 
0.052717 
0.046324 
0.054143 
0.052200 
0.019365 
0.022871 
0.122603 
0.056461 
0.040344 
0.049796 
0.046818 
0.049387 
0.039543 
0.034593 
0.055396 
0.054005 



an estimation for the maximal value of the information 
capacity I|1UI) . It is evident, that the maximal level of 
the information capacity would be observed for the case 
where the reconstructed dictionary W q has the maximal 
absolute entropy, i.e. f Ui = f Uj ,V(i,j), while the real 
frequency of the same thickness must be as far from equi- 
librium, as possible. 

Consider an infinitely long periodical sequence from 
two- letter alphabet {0, 1}: 

...01010101... , 

with dictionary W\ having two words (these are the sym- 
bols) with equal frequencies. The dictionary W 2 has only 
two words: 01 and 10, with equal frequencies, so that 
hi = ho = V 2 > while hi = ho = 0. Formula JoBJl (see 
also JTUJO yields S2 — In 2. Consider, then, an infinite 
periodical sequence 

...1100... . 

n 

Again, this sequence exhibits an equilibrium frequency 
dictionary W2 and quasi-equilibrium dictionary W3; for- 
mula l)9a|) yields the same value of S3 = In 2. Similar 
sequence could be figured out for any equilibrium dictio- 
nary W q ; thus, maximal value of 10 is equal to In 2 for 
any q. 

The sequence from four-letter alphabet H = 
{A, C,G,T} with equilibrium dictionary W%, and quasi- 
equilibrium dictionary W2 is evident: 

. . . ATGCATGCATGC . . . . 

Formula (|9b|) yields S2 = 2 In 2. An infinite periodical 
sequence (AACCGGTTGAGCATCT)„ provides the same 



pattern of dictionaries, with S3 = 2 In 2. Going this way, 
one obtains S q = 2 In 2, for any q. 

If a sequence is arranged from an alphabet H of the car- 
dinality M, then the upper level of information capacity 
© for such sequence is equal to 

- _ M -In 2 
Sq ~ 2 ' 

The sense of this relation is clear and obvious: that is the 
indeterminacy of a choice of a word of the length q from 
the subset of equally distributed ones, with respect to the 
fact that only a half of all possible words have positive 
frequency. 



B. Information capacity and redundancy of 
sequences 

Intuitively, redundancy is a measure of an excess of 
information content observed within a sequence. Tradi- 
tionally, a redundancy is defined through a two-symbol 
correlations, or two-symbol entropy calculation, in com- 
parison to a single symbol distribution [29 . 127ll2a . 12a . l30| . 
A study of information capacity of frequency dictionar- 
ies provides a researcher with more advanced definition 
of the sequence redundancy. 

Consider again the chain of dictionaries. A fre- 
quency dictionary W q is the redundant one, if it guaran- 
tees an unambiguous reconstruction of thicker dictionary. 
Critical thickness d* of the redundant dictionary could 
be determined constructively; d* is the thickness yield- 
ing uniqueness of any word in it. Some biological issues 
concerning the determination of d* for various segments 
of genes are presented and discussed in j^ll |3^, yij, 0| . 
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TABLE III: Information capacity HUH of several eukaryotic genomes. 



Entry 


Chromosome 


N 


Frequency thickness 
q — 2 9 = 3 q = 4 q = 5 g = 6 9 = 7 


9 = 8 


9 = 9 


Encephalitozoon cuniculi 





T 
1 


9DQQS9 


u 


(19041 Q 


n mn9i ^ 


U.UUOU'i 1 


u 


uuouoo 


n 
u 


nnQziQ7 


n 

u 




n 

u 


i fiQ9m 

luyzui 


n 
u 


Q1 SAA 1 ) 


CNS07EG9 


II 


197426 





019526 


0.010266 


0.004844 





003853 





008743 





027257 





109101 





324783 


CNS07EGA 


III 


194439 





018463 


0.010184 


0.004661 





004254 





008309 





028256 





111627 





330016 


CNS07EGB 


IV 


218329 





020161 


0.009449 


0.004475 





004010 





007722 





024553 





099807 





308234 


CNS07EGC 


V 


251002 





019684 


0.010673 


0.004171 





003981 





008430 





026259 





102691 





309111 


AL590446 


VI 


211018 





020997 


0.010878 


0.004720 





004064 





008074 





025227 





100142 





302868 


AL590447 


VII 


220294 





021018 


0.009918 


0.004688 





003929 





007445 





024660 





096422 





297431 


AL590448 


VIII 


226576 





020570 


0.009823 


0.004718 





003770 





007709 





023050 





091410 





292643 


AL590445 


IX 


238147 





019905 


0.009738 


0.004288 





003486 





006685 





022291 





088252 





284922 


AL590449 


X 


262797 





020453 


0.010226 


0.004451 





003865 





006990 





021142 





083162 





271742 


AL590450 


XI 


267509 





019526 


0.009886 


0.004695 





004067 





006937 





021279 





082380 





271006 


Eremothecium gossypii 


AE016814 


I 


691920 





004391 


0.002940 


0.005624 





001475 





002799 





010756 





029318 





127641 


AE016815 


II 


867694 





004230 


0.003054 


0.004974 





001441 





002521 





009130 





023515 





100758 


AE016816 


III 


907057 





004661 


0.002835 


0.006406 





001625 





002485 





010476 





022534 





096182 


AE016817 


IV 


1466891 





004174 


0.002903 


0.005306 





001186 





001789 





007174 





014014 





056525 


AE016818 


V 


1519138 





004194 


0.002792 


0.005086 





001239 





001775 





006852 





013755 





054365 


AE016819 


VI 


1812713 





004247 


0.002820 


0.005535 





001184 





001736 





006983 





011623 





044951 


AE016820 


VII 


1476021 





004276 


0.002914 


0.005688 





001366 





001792 





007751 





014080 





056334 


Plasmodium falciparum 


AL844501 


1 


643292 





004537 


0.021562 


0.012614 





016715 





014143 





026339 





043922 





094463 


AE001362 


2 


947102 





005162 


0.023034 


0.010135 





011752 





008768 





017372 





030138 





069398 


AL844502 


3 


1060087 





005085 


0.021935 


0.010529 





013345 





009207 





016876 





027924 





065252 


AL844503 


4 


1204112 





006206 


0.022380 


0.011045 





013383 





010601 





019553 





028004 





065541 


AL844504 


5 


1343552 





005692 


0.024463 


0.009726 





011810 





007075 





013638 





021777 





051344 


AL844505 


6 


1418244 





005632 


0.022289 


0.009903 





012013 





008739 





016125 





024536 





054656 


AL844506 


7 


1351552 





005754 


0.021366 


0.010259 





010999 





007295 





014541 





022528 





055176 


AL844507 


8 


1325595 





006079 


0.024584 


0.010025 





012517 





007839 





014781 





022590 





054937 


AL844508 


9 


1541723 





005251 


0.024587 


0.009218 





012546 





008096 





014320 





020518 





048233 


AE014185 


10 


1694445 





004614 


0.023440 


0.011864 





012760 





009820 





017158 





023704 





049696 


AE014186 


11 


2035250 





005433 


0.024747 


0.009713 





011386 





006347 





012405 





016579 





038632 


AE014188 


12 


2271916 





005712 


0.023763 


0.009716 





011348 





006377 





011493 





014971 





036721 


AL844509 


13 


2732359 





005614 


0.022843 


0.008908 





010828 





006489 





011203 





013794 





031419 


AE014187 


14 


3291006 





005750 


0.024419 


0.008603 





010351 





005407 





009717 





011047 





025341 


Trypanosoma brucei 


AL929608 


1 


1056003 





008134 


0.011493 


0.011055 





008356 





013433 





026265 





043841 





119718 


AE017150 


2 


1193931 





008328 


0.009158 


0.008691 





004800 





005640 





013662 





035133 





116308 


Candida glabrata strain CBS138 


CR380947 


A 


485192 





006112 


0.002636 


0.003803 





002073 





003990 





013642 





045155 





174482 


CR380948 


B 


502101 





006527 


0.002561 


0.003240 





001568 





003522 





012129 





042753 





163491 


CR380949 


C 


558804 





006026 


0.002585 


0.003555 





001937 





003717 





012608 





041452 





157846 


CR380950 


D 


651701 





006884 


0.002475 


0.002945 





001584 





002988 





009860 





032530 





128877 


CR380951 


E 


687501 





006841 


0.002654 


0.003418 





001701 





003165 





010837 





033211 





126183 


CR380952 


F 


927101 





006915 


0.002329 


0.002912 





001432 





002149 





007396 





022390 





090687 


CR380953 


G 


992211 





007020 


0.002484 


0.003094 





001407 





002323 





007444 





021144 





085899 


CR380954 


H 


1050361 





006912 


0.002489 


0.003002 





001256 





002165 





006884 





020244 





081305 


CR380955 


I 


1089401 





006330 


0.002423 


0.003010 





001320 





002296 





007551 





021251 





081453 


CR380956 


J 


1192501 





006948 


0.002566 


0.003183 





001570 





002191 





007090 





018903 





073283 


CR380957 


K 


1302002 





006974 


0.002343 


0.002822 





001261 





001924 





005637 





016214 





064765 


CR380958 


L 


1440588 





006895 


0.002597 


0.002882 





001305 





001999 





006217 





015926 





060181 


CR380959 


M 


1400893 





006922 


0.002479 


0.002850 





001349 





001845 





005611 





014860 





060465 










Yarrowia lipolytica strain 


CLIB99 


















CR382127 


A 


2303261 





007971 


0.004445 


0.004349 





002684 





003795 





006816 





012190 





040135 


CR382128 


B 


3066374 





007967 


0.004337 


0.004082 





002549 





003753 





006128 





009337 





028461 


CR382129 


C 


3272609 





008314 


0.004102 


0.004062 





002532 





003487 





005935 





009274 





028189 


CR382130 


D 


3633272 





008142 


0.004317 


0.004421 





002759 





003827 





006058 





008593 





024927 


CR382131 


E 


4224103 





008102 


0.004520 


0.004068 





002529 





003677 





005642 





007702 





021561 


CR382132 


F 


4003362 





007716 


0.004372 


0.004336 





002662 





003557 





005920 





007903 





022481 



6 



Definition of redundancy trough the length of the longest 
common repeat is simple and transparent; meanwhile, it 
has a serious disadvantage. The measure of redundancy 
tends to N (here N is the length of entire sequence), for 
rather simple and obviously redundant sequences, such 
as periodical ones. 

On the contrary, the calculation of information capac- 
ity is free of that discrepancy. As soon, as S q — for 
some q, then this thickness of dictionary should be con- 
sidered as a redundancy measure. Surely, this specific 
thickness q depends both on a structure of a sequence 
(whatever one understands for that), and on its length. 



C. Information capacity of some real genetic 
systems 

Here we present some results of the information capac- 
ity <P~0|) determination for real nucleotide sequences. We 
studied the complete genomes of bacteria and eukaryotes; 
all the entities are deposited at EMBL-bank. 

Table [I] shows the results of the information capac- 
ity calculation for the Schizosaccharomyces pombe yeast 
complete genome and for protozoan Encephalitozoon cu- 
niculi complete genome. It is evident, that the pattern 
of information capacity (|lf)|> is bell-shaped. Such pat- 
tern results from a finiteness of the length of the original 
sequence. Indeed, as q approaches the length of 20 nu- 
cleotides, the abundance of a complete support of a dic- 
tionary becomes equal to 4 20 ; this number exceeds 10 12 ; 
such long genomes are not found yet. The huge number 
of words follows in a lack of the greater part of them in 
a dictionary. This exponentially growing abundance fol- 
lows in a degeneration of a thicker frequency dictionary: 
the greatest majority of the words occur in a single copy 
(see section Ull Bl and |3ll l32l l33l l34j ). 

A study of location and value of maximum of informa- 
tion capacity I jlOjl makes sense for the sequences of a close 
length. Besides, the location (i.e. the thickness q) of the 
maximum and its value are both sensitive to a structure 
of a sequence. A degeneracy of a dictionary follows in 
a shift of the maximum of i|10|) to shorter words. The 
genomes are rather diverse, from that point of view. 

The behaviour of information capacity for 2 < q < 9 
is of the greatest interest. It is evident, that the in- 
formation capacity (|10f) varies non-monotonouslv. An 
excess of S2 over S3 is a well known fact flil ll^. l2pj 
with rather clear biological explanation [H \A l4l| The 
occurrence of two or three minima in the information 
capacity pattern is a newly established fact. It should 
be said, that a multi-minima pattern is rather widely 
spread among the studied genomes, while it is not oblig- 
atory. The genome of E.cuniculi exibits a single mini- 
mum of the information capacity at q = 5, for all chro- 
mosomes. Table [H] shows the information capacity for 
2 < q < 9 for nineteen complete genomes of archeabac- 
teria. Eleven entities have two minima of the informa- 
tion capacity. This table presents a new phenomenon: 



three genomes exhibit an inversion in the information 
capacity variation at 2 < q < 4. These are Pyrobac- 
ulum aerophilum (identifier AE009441), Thermoplasma 
acidophilum (identifier AL139299) and Thermoplasma 
volcanium (identifier BA000011). 

The inversion observed in a family of archeabacteria 
could also be observed in other genomes, with various 
taxonomy. Tabic ITTT1 shows the information capacity de- 
termined for 2 < q < 9 over various eukaryotic genomes. 
Finally, Table IIVI present the most abundant data con- 
cerning the behavior of information capacity (|l(Jfl for 
over 150 complete genomes of eubacteria. Few words 
should be said towards the format of this Table. Due 
to space limitations, all nomenclature of bacteria is pro- 
vided in a shortened form. Some lines are identified with 
the same name of a species; it means, that such enti- 
ties belong to different strains, or different serovariants. 
The detailed information concerning the taxonomy of the 
genome could be retrieved from EMBL-bank by an iden- 
tifier. 



IV. DISCUSSION 

A researcher capitalizes a lot from the studies of the 
statistical properties of nucleotide sequences. Here we 
propose a novel approach towards the definition of the 
information capacity of a frequency dictionary of a se- 
quence. The key idea of the information capacity defi- 
nition is the comparison of real and expected frequency 
of considerably short strings occurred within a symbol 
sequence. A definition of an expected frequency is the 
basic problem in the studies of information properties of 
such entities. 

Basically, there are two approaches to identify an ex- 
pected frequency. The former is to change a sequence un- 
der consideration for some surrogate entity with known 
(or specially prepared) statistical properties, say , con- 
sider a realization of some random process [2a. |42| . The 
latter is to figure out the most expected continuation of a 
string keeping within the information available at the fre- 
quency dictionary, only. Changing an original sequence 
for surrogate one, a student involves into a study outer, 
additional information. Such intrusion of the additional 
information may conflict with reliability of the retrieved 
knowledge and conspire some fine properties of the orig- 
inal sequence. 

Studying the statistical properties of symbol se- 
quences, researchers quite often restrict themselves with 
the consideration of mono- and dinucleotide distribution 
HIlMiniHEQGjl A breakthrough in that direc- 
tion results from the fundamental studies in Boltzmann's 
equation [2lJ LilJ- 123 Lid Lili Lii| , which were success fully 
converted into the field of bioinformatics 
In this paper, we implemented the version of the method 
of invariant manifolds for figuring out the formula for the 
most expected continuation of a string. Since the strings 
are discrete objects, the formula (|10fl becomes the ex- 
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TABLE IV: Information capacity 1101 of several eukaryotic genomes. 



Entry 


N 


Species 




q = 2 




9 = 3 




q = 4 




q = 5 




q = 6 




9 = 7 




9 = 8 




9 = 9 


CR543861 


3598621 


Acinetobacter sp. 





013695 





006754 





003180 





003494 





003362 





003889 





007593 





025350 


AE007870 


2841581 


Ag.t umefaciens 





024522 





012524 





007616 





008113 





004989 





007159 





013111 





042823 


AE008689 


2074782 


Ag.t umefaciens 





024518 





012519 





007616 





008108 





004986 





007162 





013104 





042800 


AE007869 


2841490 


Ag.tumefaciens 





025693 





013720 





008115 





008678 





005462 





007398 





011357 





032478 


AE008688 


2075560 


Ag.tumefaciens 





025694 





013721 





008115 





008678 





005461 





007397 





011352 





032470 


AE000657 


1551335 


Aq.aeolicus VF5 





025086 





013949 





010799 





006867 





005535 





009461 





016737 





057694 


AE017225 


5227293 


B.anthracis 





005011 





004804 





006296 





003584 





002930 





004548 





006860 





019529 


AE016879 


5228310 


B.anthracis 





005011 





004803 





006296 





003586 





002928 





004543 





006852 





019510 


AE017334 


5228663 


B.anthracis 





005013 





004802 





006294 





003584 





002926 





004542 





006852 





019509 


AE017194 


5224283 


B.cereus 





005092 





004713 





006247 





003507 





002941 





004795 





007201 





020537 


AE016877 


5411809 


B.cereus 





004948 





004500 





006028 





003295 





002789 





004601 





007054 





020032 


BA000004 


4202352 


B.halodurans 





011038 





003900 





005231 





002782 





001808 





003647 





007395 





024234 


AL009126 


4214630 


B.subtilis 





016737 





009942 





004373 





001952 





001894 





003744 





006976 





023948 


AE017355 


5237682 


B. thuringiensis 





004957 





004765 





006262 





003542 





002932 





004808 





007532 





021289 


AE015928 


6260361 


B. thetaiotaomicron 





004681 





010665 





006649 





002838 





002634 





003228 





004825 





015111 


BX897699 


1931047 


B.henselae 





015746 





005150 





004615 





002532 





002570 





005103 





013421 





049652 


BX897700 


1581384 


B.quintana 





016423 





004543 





004067 





002225 





002388 





005020 





014100 





056008 


BX842601 


3782950 


B. bacteriovorus 





021078 





010726 





003837 





004967 





004054 





005209 





008686 





024182 


AE014295 


2256646 


B.longum 





013680 





010854 





005459 





007544 





005795 





009393 





013759 





043138 


BX470250 


5339179 


B. bronchiseptica 





019051 





018226 





014946 





009624 





007019 





014473 





010739 





020758 


BX470249 


4773551 


B. parapertussis 





019309 





018475 





015118 





009716 





007072 





014730 





011562 





023433 


BX470248 


4086189 


B.pertussis 





018533 





018289 





014508 





010287 





010869 





026770 





044285 





097603 


AE000783 


910724 


B. burgdorferi 





023227 





002133 





003246 





002873 





003887 





009814 





024221 





076756 


BA000040 


9105828 


B.japonicum 





027282 





009713 





008155 





006690 





004230 





007997 





006765 





013241 


AE008917 


2117144 


B.melitensis, chr.I 





028077 





011210 





006071 





007949 





007784 





007177 





013674 





043426 


AE008918 


1177787 


B.melitensis, chr.II 





028054 





010904 





006750 





008201 





008306 





008633 





020434 





070601 


AE014291 


2107793 


B.suis 1330, chr.I 





028299 





011342 





006123 





008063 





007878 





007275 





013840 





043651 


AE014292 


1207381 


B.suis, chr.II 





027777 





010778 





006642 





008118 





008243 





008567 





019980 





069020 


BA000003 


640681 


B.aphidicola 





007608 





004984 





007715 





002464 





004401 





011496 





033328 





100963 


AE016826 


615980 


B.aphidicola 





004517 





004223 





005281 





002105 





004130 





011172 





034193 





097610 


AE013218 


641454 


B.aphidicola 





010056 





005607 





007001 





002654 





004619 





011923 





032871 





093878 


AL111168 


1641481 


C.jejuni 





025925 





007195 





008380 





006480 





006136 





009552 





017044 





049726 


BX248583 


705557 


B.Horidanus 





004191 





004090 





004874 





002726 





004370 





010925 





030405 





093783 


AE005673 


4016947 


C.crescentus 





015292 





022869 





012056 





012406 





011076 





028279 





070484 





163516 


AE002160 


1072950 


Ch .muridarum 





015077 





005375 





002794 





002095 





002951 





006255 





019863 





078997 


AE001273 


1042519 


Ch . trachomatis 





013610 





005905 





003162 





002003 





002976 





006651 





020558 





082006 


AE015925 


1173390 


Ch.caviae 





010397 





005354 





003291 





003658 





004232 





006445 





018200 





072489 


AE002161 


1229853 


Ch.pneumoniae 





013529 





004222 





002565 





003561 





003454 





006385 





017356 





069847 


AE001363 


1230230 


Ch.pneumoniae 





013530 





004219 





002566 





003566 





003454 





006398 





017363 





069779 


BA000008 


1226565 


Ch.pneumoniae 





013539 





004219 





002549 





003553 





003438 





006368 





017319 





069720 


AE009440 


1225935 


Ch.pneumoniae 





013506 





004219 





002566 





003564 





003453 





006393 





017342 





069793 


AE006470 


2154946 


Ctepidum 





020061 





011061 





008408 





006773 





008106 





008369 





012377 





041511 


AE016825 


4751080 


Cviolaceum 





019041 





018825 





014015 





009803 





007443 





016733 





011386 





024193 


AE001437 


3940880 


C.acetobutylicum 





010606 





005123 





004222 





001690 





002404 





005410 





008253 





025033 


BA000016 


3031430 


Cperfringens 





017457 





004479 





005310 





002498 





004307 





008116 





011921 





030875 


AE015927 


2799251 


Ctetani 





014410 





006102 





006276 





002459 





003594 





006788 





010479 





031232 


BX248353 


2488635 


Cdiphtheriae 





007448 





003906 





005199 





004450 





003300 





005425 





010729 





035929 


BA000035 


3147090 


CefEciens 





013561 





010039 





011470 





009562 





006170 





009947 





013196 





034160 


AX114121 


3309400 


Cglutamicum 





011670 





005214 





004643 





005185 





003740 





005143 





009104 





027874 


BA000036 


3309401 


Cglutamicum 





011670 





005214 





004643 





005185 





003740 





005143 





009104 





027873 


BX927147 


3282708 


Cglutamicum 





011725 





005243 





004649 





005196 





003758 





005142 





009162 





028107 


AE016828 


1995275 


C.burnetii strain 





020865 





001531 





004846 





001590 





002105 





004843 





012999 





047250 


AE000513 


2648638 


D.radiodurans, chr.I 





010846 





008048 





013872 





009601 





016747 





013908 





015656 





038380 


AE001825 


412348 


D.radiodurans, chr.2 





010821 





007962 





014203 





009626 





018188 





022831 





054850 





149147 


AE017285 


3570858 


D. vulgaris 





010306 





013164 





009729 





010660 





007107 





007437 





009646 





026623 


AE016830 


3218031 


E.faecalis 





013746 





002634 





005872 





002905 





003065 





004865 





008791 





028356 


BX950851 


5231428 


E.carotovora 





011198 





010238 





006360 





005196 





003841 





003902 





006142 





018491 


AE014075 


4639675 


E.coli 





011808 





012713 





008453 





004938 





004337 





003890 





006233 





018843 


U00096 


5498450 


E.coli K-12 





012761 





012604 





008689 





005368 





004935 





004300 





007042 





021171 


AE005174 


5528970 


E.coli 





011766 





012788 





008379 





004734 





004253 





004040 





006233 





018996 


BA000007 


2174500 


E.coli 





011825 





012822 





008422 





004794 





004350 





004063 





006187 





018753 
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008109 





005166 





005014 





005425 





011490 





031813 


AE005674 


4607203 


Sh.Bexneri 





011748 





012673 





008080 





005184 





005035 





005501 





011795 





032345 


AL591688 


4599354 


S.meliloti 





026681 





010958 





007164 





006438 





003565 





007529 





009891 





027683 


RX571856 


2902619 


St. aureus 





005913 





002068 





005134 





002628 





003085 





005770 





010129 





032952 


RX571857 


2799802 


St. aureus 





005964 





002117 





005290 





002717 





003144 





005981 





010413 





033942 
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BA000017 


2820462 


St. aureus 







005854 





002085 





005151 





002687 





003078 





005765 





010071 





032912 


BA000033 


2878529 


St. aureus 







005982 





002104 





005273 





002719 





003128 





005970 





010400 





033816 


BA000018 


2814816 


St. aureus 







005914 





002130 





005207 





002716 





003127 





005906 





010446 





034171 


AE015929 


2499279 


St.epidermidis 







003767 





001716 





004630 





002240 





002595 





006244 





011690 





038364 


AE009948 


2160267 


St.agalactiae 







006904 





002936 





005024 





001858 





002747 





005824 





013064 





044783 


AL732656 


2211485 


St.agalactiae 







007217 





002855 





005103 





001912 





002754 





005794 





012886 





044199 


AE014133 


2030921 


St.mutans 







013456 





004012 





005658 





002547 





002924 





005414 





012257 





044677 


AE007317 


2038615 


St. pneumoniae 







011668 





004819 





005481 





002696 





003326 





006172 





013436 





046620 


AE005672 


2160837 


St. pneumoniae 







011575 





004695 





005364 





002744 





003515 





007163 





014714 





047388 


AE004092 


1852441 


St.pyogenes 







010333 





004626 





004952 





002287 





002627 





005484 





013313 





048073 


AE014074 


1900521 


St.pyogenes 







010106 





004583 





004939 





002203 





002613 





005468 





013328 





048020 


BA000034 


1895017 


St.pyogenes 







010166 





004620 





004978 





002213 





002580 





005414 





013187 





047412 


AE009949 


1894275 


St.pyogenes 







010180 





004543 





004911 





002241 





002626 





005417 





013576 





048444 


BA000030 


9025608 


St.avermitilis 







011393 





010642 





010511 





007005 





006183 





009745 





006162 





012784 


AL645882 


8667507 


St.coelicolor 







011072 





011860 





012735 





007970 





007255 





011645 





006942 





013581 


BX548020 


2434428 


Synechococcus sp. 







017464 





011350 





007426 





005454 





003216 





007177 





011822 





038270 


BA000022 


3573470 


Synechocystis sp. 







023278 





004463 





008626 





009390 





006909 





006308 





009156 





026581 


AE008691 


2689445 


Th . tengcongensis 







018306 





007426 





006789 





002851 





003094 





006214 





011581 





038095 


BA000039 


2593857 


Th.elongatus 







015134 





004435 





008833 





008839 





007238 





007249 





013275 





040696 


AE000512 


1860725 


Th.maritima 







028543 





017235 





006157 





003558 





003491 





006186 





013606 





047372 


AE017221 


1894877 


Th.thcrmophilus 







030010 





028048 





020868 





021224 





014340 





019560 





020008 





040026 


AE017226 


2843201 


T.denticola 







020824 





012443 





007493 





003559 





002909 





005231 





009348 





031935 


AE000520 


1138011 


T. pallidum 







008565 





013489 





004322 





002325 





002653 





006578 





018883 





077105 


AE014184 


925938 


T.whipplei 







005901 





006857 





003644 





002028 





004167 





008065 





026251 





103984 


BX072543 


927303 


T.whipplei 







005888 





006799 





003663 





002031 





004196 





008171 





026376 





103937 


AF222894 


751719 


U.urealyticum 







012392 





005304 





008392 





003739 





005607 





013331 





030211 





081295 


AE003852 


2961149 


V.cholerae, chr.I 







012525 





008069 





003458 





005102 





003751 





004211 





009380 





031341 


AE003853 


1072315 


V.cholerae, chr.II 







013503 





006883 





003994 





005358 





005951 





010563 





026775 





088844 


BA000031 


3288558 


V.parahaemolyticus, 


chr.I 





010957 





006585 





003386 





003249 





002822 





003810 





008719 





028953 


BA000032 


1877212 


V.parahaemolyticus, 


chr.2 





012741 





005847 





003816 





003567 





003113 





004436 





012061 





045329 


AE016795 


3281945 


V.vulnificus, chr.I 







012698 





006581 





003817 





003317 





002780 





003857 





009136 





029434 


AE016796 


1844853 


V.vulnificus, chr.II 







014850 





006728 





004751 





004073 





003228 





004387 





012116 





045682 


BA000037 


3354505 


V.vulnificus. chr.I 







012488 





006641 





003806 





003326 





002752 





003844 





009066 





029050 


BA000038 


1857073 


V.vulnificus. chr.II 







014921 





006935 





004853 





004106 





003306 





004377 





012018 





045477 


BA000021 


697724 


W.glossinidia 







014766 





007419 





006667 





002547 





005068 





011597 





029992 





080185 


AE017196 


1267782 


W. endosymbiont 







009537 





003220 





003734 





002173 





004236 





009704 





024983 





077065 


BX571656 


2110355 


W.succinogenes 







028493 





021856 





012718 





008349 





006376 





006379 





013819 





043435 


AE008923 


5175554 


X.axonopodis 







024169 





014838 





007583 





008203 





004462 





008442 





008844 





020333 


AE008922 


5076188 


X.campestris 







023799 





014241 





008130 





008585 





004921 





009403 





009628 





022067 


AE003849 

A J_J UWOOt:J 


2679306 






o 


01 1 896 


o 


003955 


o 


004727 


o 


003349 


o 


003094 


o 


00491 4 


o 


009535 


o 


033301 


AE009442 


2519802 


X.fastidiosa 







011900 





004018 





004997 





003470 





003060 





005249 





010488 





036900 


AE017042 


4595065 


Y.pestis 







009359 





008596 





005810 





003897 





003320 





003660 





007726 





024715 


AE009952 


4600755 


Y.pestis 







009320 





008555 





005798 





003861 





003332 





003717 





008030 





025373 


AL590842 


4653728 


Y.pestis 







009318 





008568 





005805 





003873 





003394 





003911 





008721 





026808 



termination fails to figure out the situations of highly 
ordered (e.g. periodical) sequences. The definition of re- 
dundancy through an information capacity calculation is 
free from that discrepancy. It should be kept in mind, 
that zero value of l|10|) does not automatically yield an 
exact and unambiguous reconstruction of a thicker finite 
dictionary $ q . A redundancy, then, is to be considered 
in two interrelated but individual ways; the former is the 
measure defined through the 3d* perfect expansion up to 
an entire symbol sequence, and the latter is a high level 
of predictability of a continuation of each word. 



act solution, on the contrary to the situation of a typical 
physical situation [22, EE Ea| • 

Zero information capacity of a frequency dictionary W q 
means the exact and unambiguous extension of the given 
dictionary into any thicker one Wk, k > q. This point 
provides a student with the new tool to define a redun- 
dancy of a frequency dictionary. Further, we shall un- 
derstand the redundancy of a dictionary for the redun- 
dancy of a sequence itself. There is a simpler way to 
define the redundancy; it is based on the determination 
of the longest repeat within a sequence 0, |3^, |33l 0| . 
It was found that the redundancy of introns exceeds that 
latter for exons, and the splicing results in a decrease 
of a general gene redundancy. Meanwhile, it should be 
stressed, that this simple method of the redundancy de- 
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A. Markov models and maximum entropy principle 

Study of nucleotide sequences with the Markovian pro- 
cesses is rather popular |23l |42j • The motivation behind 
such popularity is quite transparent: Markov process 
provides a researcher with numerous ways to fit a spe- 
cific realization of the process to a given symbol sequence. 
Basic idea of a search of so called hidden Markov models 
of a nucleotide sequence consists in a choice of the min- 
imal order Markov process, which matches the sequence 
satisfactory. It should be said, that this approach dis- 
tinguishes quite properly coding regions of a genome vs. 
the non-coding ones ^3 • The invariance in triplet distri- 
bution found for non-coding regions accompanied with a 
distinct and well structured pattern of a triplet distribu- 
tion observed within coding regions 0, la makes such 
good efficiency of Markov models for separation of coding 
vs. non-coding regions evident. 

Formally speaking, there always exists a Markov pro- 
cess, that perfectly fits a sequence. Indeed, Markov pro- 
cess of the d* order developed over a sequence would 
match this latter perfectly, with no variations, at all. 
Obviously, such Markov model brings no biological inspi- 
ration. Nevertheless, a search for minimal order Markov 
process matching a sequence may make sense. The key 
idea of a separation of coding regions from the non-coding 
ones due to Markov models consists in a seeking for the 
points of an abrupt change in the order of the relevant 
process. More fine and effective approach furthering the 
hidden Markov model is discussed in Section II V CI 



B. On a fractal structure of genomes 

Comprehensive investigations of statistical properties 
of nucleotide sequences reveal some interesting (and im- 
portant) features of that latter. Researchers identify var- 
ious fractal structures and fractal-like patterns within ge- 
netic entities |43|, |4J] . Probably, the nucleotide sequences 
are quite complex object exhibiting a great variety of 
properties, including those, which are suspected to be 
a fractal pattern. A study of information capacity re- 
veals an increased correlation in a combinations of vari- 
ous strings through the non-monotonic behaviour of that 
latter observed at different length q of words. 

Examination of the tables shows a presence of the 
genomes that exhibit one, two or three local minima of 
information capacity l|10l) . Local minimum observed at 
the length / means that some combinations of two words 
of that length prevail among the others. Correlations 
in short strings occurrence is evident, if several minima 
of information capacity are observed. The correlations 
among some short strings are less evident, if a single min- 
imum is observed within a sequence. The point is that 
such single minimum may result from a finite sampling 
of a sequence. 

Nonetheless, one hardly could explain an occurrence of 
a single minimum of information capacity l|10f> by a finite 



sampling effect, solely. The point is, that the location of 
such singe minimum varies significantly, for various bac- 
terial genomes. Suppose, the location of the single mini- 
mum of information capacity is determined by the finite 
sampling effect; then, the specific length q of words where 
the minimum is observed should be the same, for all such 
genomes. This follows from the dependence of the max- 
imum of l|10[l : that former is defined mainly by the fi- 
nite sampling effect. Basically, the finite sampling ef- 
fect would manifest through the logarithmic dependence 
of the position of the maximum (and the minimum, in 
turn) on the length of a sequence. Observed diversity 
of the lengths where the minimum occurs breaks down 
the original supposition. Thus, the local minima (with 
no respect to the number of these latter observed within 
a sequence) represent a structure, which might be con- 
sidered as a fractal pattern; detailed discussion of that 
matter falls beyond the scope of the paper. 

C. Information valuable words 

Let's have a look at the definition of information capac- 
ity Q). It is evident, that the major contribution into the 
sum is provided by the terms with the highest possible 
deviation of real frequency f w from the most expected 
one fu- These are the words of increased information 
value. More exactly, a, a > 1 be the information value 
threshold. A word u>' is of information value, if it falls 
out of the range determined by the double inequality 

a -i<^l< a . (ii) 

There are two types of information valuable words: the 
former are the words with an excess of real frequency over 
the expected one, and the latter are the words with an ex- 
cess of expected frequency over the real one. We call the 
words of the first type (of the second type, respectively) 
the ascending ones (the descending ones, respectively). 
Whether a word 10 is of information value, or not, de- 
pends on a structure of a sequence, of the threshold a, 
and on the length q of a word. 

Of course, the choice of a value still is the matter of 
expertise of a student. There is no formal way to put on 
the a level. To clarify this point, one has to study the 
distribution of the words at a dictionary W q over their 
information value p — j^. While the expected frequency 

f u is explicitly derived from the real frequency of the 
words (see (@J), less in known towards the distribution of 
words over the real (and expected, in turn) frequency. 

Obviously, p max and p m in depend on a structure of nu- 
cleotide sequence. An estimation for p m i n is apparent: 
minpmin = 0; less is known concerning the estimation 
of p m ax- To clarify this point, more studies should be 
carried out; they fall beyond the scope of this paper. 

Suppose, the threshold value a is put on. The thresh- 
old identifies two sets of information valuable words; the 
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former is the ascending one, and the latter is descending 
one. A quality of being the information valuable word has 
no monotony: given ascending information valuable word 
ZJ (descending word lu, respectively) of the length q may 
be embedded into a longer one, or may be not. More- 
over, if no embedment is found for one symbol longer 
information valuable words, one can not guarantee the 
embedment absence into the information valuable words 
of the length I, where I > q + 1. Besides, a longer in- 
formation valuable word may incorporate a shorter one 
with the opposite order of p. 

Consider the uniform sets of information valuable 
words of increasing length: 

{tU 3 },{w4},.--,{w g } and fe}, fej, . . . , {wj , 

identified for given a > 1. A chain 

0J3 C to a C . . . C ZJk (12a) 

or 

uj_ 3 C ^4 C . . . C co k (12b) 

is the ascending shoot or descending shoot, respectively. 
The shortest word within a shoot (|12fl is a root, and the 
longest one is an apex. A union of all the shoots with 
the same root makes a pyramid. Thus, a pyramid may 
be an ascending, or a descending one. An information 



valuable word is an entity, where given Markov model 
changes for another one (of the other order, etc.). A 
pyramid gathers the entities where the variation of the 
relevant Markov model takes place in coordination, for all 
the scales 3 < q < k. Obviously, a simultaneous change 
of a Markov model could hardly take place occasionally. 
Hence, the apices of the pyramids (|12|) identifies the sites 
within a genome. 

A study of distribution of the apices alongside a 
genome [2l], shows a high level of the correlation be- 
tween a location of the apices, and the functional role 
of the sites, where they occur. Such study makes a core 
of very promising approach in the investigations of the 
relation of structure and function of biological macro- 
moleculae, while the detail discussion of that subject falls 
beyond the scope of this paper. 
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